Session 10: Conclusion

Introduction to Web Scraping and Data Management for Social Scientists

Johannes B. Gruber

2024-08-02

This Course

tinytable_whtk2ogzyikiue1qd1ep
Day Session
1 Introduction
2 Data Structures and Wrangling
3 Working with Files
4 Linking and joining data & SQL
5 Scaling, Reporting and Database Software
6 Introduction to the Web
7 Static Web Pages
8 Application Programming Interface (APIs)
9 Interactive Web Pages
10 Conclusion

1. Introduction

1. Introduction: What we learned about

  • Using Quarto and RStudio projects in the course
  • Packages and functions in R
  • How to use the R help docs and other ways to learn more
  • Functions, Data, Loops and If in R
  • Tidyverse vs. base R and the pipe |>
  • Literate programming

2. Data Structures and Wrangling

2. Data Structures and Wrangling: What we learned about

  • how data plays into the research process
  • the difference between content and structure of data
  • about the basic data structures in R and what they are good for
  • how to turn information into data
  • the key role of tables
  • and how to turn bad data structures into good tables

Zane Lee via unsplash.com

3. Working with Files

3. Working with Files: What we learned about

In this session, you learn:

  • how to use files efficiently and how to solve problems using files
  • good practices for transparent and efficient file usage
  • how to work with many files at the same time
  • and how you can facilitate collaborative working with files

JF Martin via unsplash.com

4. Linking and joining data & SQL

4. Linking and joining data & SQL: What we learned about

In this session, you learn:

  • why and how to work with relational data
  • how to join data from different tables in R
  • how to join data from different tables in SQL

Via DALL-E

5. Scaling, Reporting and Database Software

5. Scaling, Reporting and Database Software: What we learned about

In this session, you learn:

  • Repetition: DBMS
  • Working with PostgreSQL
  • Working with text databases
  • Benchmarking
  • Final scaling tips

Nik via unsplash.com

6. Introduction to the Web

6. Introduction to the Web: What we learned about

In this session, we learn how to scout data in the wild. We will:

  • discuss web scraping from a theoretical point of view:
    • What is web scraping?
    • Why should you learn it?
    • What legal and ethical implications should you keep in mind?
  • learn a bit more about how the Internet works
    • What is HTML
    • What is CSS

Angie Gade via unsplash.com

7. Static Web Pages

7. Static Web Pages: What we learned about

In this session, we trap some docile data that wants to be found. We will:

  • Go over some parsing examples:
    • Wikipedia: World World Happiness Report
  • Discuss some examples of good approaches to data wrangling
  • Go into a bit more detail on requesting raw data

Original Image Source: prowebscraper.com

Joe Caione via unsplash.com

8. Application Programming Interface (APIs)

8. Application Programming Interface (APIs) : What we learned about

In this session, we learn how to adopt data from someone else. We will:

  • Learn what an API is and what parts it consists of
  • Learn about httr2, a modern intuitive package to communicate with APIs
  • Discuss some examples:
    • A simple first API: The Guardian API
    • UK Parliament API
    • Semantic Scholar API
  • Go into a bit more detail on requesting raw data

Original Image Source: prowebscraper.com

9. Interactive Web Pages

9. Interactive Web Pages: What we learned about

In this session, we learn how to hunt down wild data. We will:

  • Learn how to find secret APIs
  • Emulate a Browser
  • We focus specifically on step 1 below

Original Image Source: prowebscraper.com

Philipp Pilz via unsplash.com

Now it’s your turn!